Multimodal Machine Learning: Integrating Language, Vision and Speech

نویسندگان

  • Louis-Philippe Morency
  • Tadas Baltrusaitis
چکیده

Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning and visual question answering, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Integration of Grounding Language and Learning Objects

This paper presents a multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input. There are two technical problems involved: (1) the correspondence problem in symbol grounding – how to associate words (symbols) with their perceptually grounded meanings from mult...

متن کامل

1 - - - From Speech Recognition to Language and Multimodal Processing

While artificial neural networks have been around for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This article, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will first reflect on the historical path to this transformative success, after providing brief...

متن کامل

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) be...

متن کامل

Multimodal Adaptive Interfaces

Our group is interested in creating human machine interfaces which use natural modalities such as vision and speech to sense and interpret a user's actions [6]. In this paper we describe recent work on multimodal adaptive interfaces which combine automatic speech recognition, computer vision for gesture tracking, and machine learning techniques. Speech is the primary mode of communication betwe...

متن کامل

Recognizing Unfamiliar Gestures for Human-Robot Interaction Through Zero-Shot Learning

Human communication is highly multimodal, including speech, gesture, gaze, facial expressions, and body language. Robots serving as human teammates must act on such multimodal communicative inputs from humans, even when the message may not be clear from any single modality. In this paper, we explore a method for achieving increased understanding of complex, situated communications by leveraging...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017